Shareware Grab Bag

home *** CD-ROM | disk | FTP | other *** search

/ Shareware Grab Bag / Shareware Grab Bag.iso / 090 / cmln1285.arc / DADA.DOC < prev next >

Wrap

Text File | 1986-02-27 | 16KB | 319 lines

This file offers additional comments, notes and hints on the Dada compiler given in DADA.PAS and described in Computer Language for December, 1985. 1. Several of the types and variables that are defined globally in DADA.PAS might better be declared as typed constants under Turbo Pascal. They would then be local to the procedures that refer to them and hence out of harm's way if some other procedure starts meddling where it doesn't belong. They must be typed "constants" rather than variables because they are static quantities: an error message, for example, must retain its value between calls to the error-listing routine. An added convenience of the Turbo typed constants is that they can be initialized at the time they are declared; thus the procedures InitKeywords and InitErrorList could be discarded. I avoided using typed constants because they are not a feature of standard Pascal. The following global variables (and their associated defined types) might be made local in this way: OutLine, OutPoint, OutBuf CH LineCount TypeSet, TFset, RelOpSet, AddOpSet, MultOpSet CurrentScope Keywords ErrorList 2. For simplicity I have defined the symbol table as a linked linear list of invariant records. Several improvements could be consid- ered. A binary tree structure would speed up the compiler; a hash table would go even faster. The only routines that would have to be changed are Find, Declare and Blot. It hardly seems worth the trouble, however; Dada is a toy language, and the compiler is unlikely to be given long programs, where speed of compilation would become an issue. If space rather than time is critical, a variant record could be defined, with SymClass as the tag field. The VarType field is needed only for variable entries and the Scope field only for procedure entries. The savings, however, amount to only one byte for VarType and two for Scope. 3. Doing a better job with errors is a lot of work. Stopping at the first error is acceptable in a compiler that has a built-in editor, where the position of the error can be indicated more or less exactly on screen. In a batch-style compiler, the favored strategy is to continue reading the source text, so that a single compilation will in principle catch errors throughout the program. The challenge is to recover from the first error well enough to continue without issuing a Niagara of spurious error messages. The Dada compiler is not capable of recovery. Removing the Halt state- ment from the Error procedure will demonstrate this effectively. 4. The error-checking done here uses compiler directives peculiar to Turbo Pascal. The main idea is that {$I-} allows the program to continue after an input-output error, leaving an error code in the system variable IoResult. If the operation is successful, IoResult is zero. For other Pascal compilers equivalent services should be available. A fuller and friendlier file-handling routine would report the nature of the error, offer the user a second chance, supply a default name for the output file, warn if an existing file will be overwritten, etc. A more convenient way of getting the file names would be to retrieve them from the the command line. I decided against doing so because the methods of accessing command-line arguments vary between Pascal compilers and between operating systems. 5. The simplicity of the scanner owes much to the simplicity of the language being recognized, but it is also made possible in part by delaying certain lexical decisions until the syntactic or even the semantic phase of program analysis. The scanner returns the same code (Ident) for all identifiers, whether they refer to variables or to procedures. The distinction between variable and procedure names is really a lexical one, but it is not made until the parser looks up the identifier in the symbol table. Postponing this part of the lexical analysis allows the scanner to operate without either lookahead of backtracking. At any moment, the scanner knows only the current input symbol (the content of variable CH), and once it has consumed that character it can never go back to read it again. In many compilers the scanner has broader responsibilities and is consequently more complex. For example, if the scanner (rather than the parser) installs identifiers in the symbol table, they must be classified unambiguously at the outset. In Dada this might be done by looking ahead, after finding an identifier, to see if it is followed by an assignment operator, which would imply that the identifier is (or ought to be) a variable name. If the assignment operator were not found, the pointer to the input stream would have to be backed up to the character immediately following the identifier. 6. The organization of the symbol table and the three routines that access the table are based on the most-closely-nested rule of Pascal and other block-structured languages. Since the table is searched from youngest to oldest, a call to Find(IdentifierName) will always return a pointer to the most recent entry that matches IdentifierName. This allows the compiler to distinguish between identifiers that have the same name but differ in scope. Suppose a Dada program makes the following declarations, causing the parser to make the symbol-table calls shown. For each line the value of CurrentScope is given in brackets: procedure Proc1; [1] { Declare(Proc1) } procedure Proc1A; [2] { Declare(Proc1A) } begin [3] end; [2] { Blot } procedure Proc1B; [2] { Declare(Proc1B) } procedure Proc1Ba; [3] { Declare(Proc1Ba) } begin [4] end; [3] { Blot } begin [3] end; [2] { Blot } begin [2] end; [1] { Blot } procedure Proc2; [1] { Declare(Proc2) } begin [2] end; [1] { Blot } The corresponding symbol-table entries would be built up by Declare and subsequently eliminated by Blot in the following order: NAME SCOPE NEXT CURRENTSCOPE FirstSym --> Proc1 1 ^<earlier> 1 Proc1 1 ^<earlier> FirstSym --> Proc1A 2 ^Proc1 2 Proc1 1 ^<earlier> Proc1A 2 ^Proc1 FirstSym --> Proc1B 2 ^Proc1A 2 Proc1 1 ^<earlier> Proc1A 2 ^Proc1 Proc1B 2 ^Proc1A FirstSym --> Proc1Ba 3 ^Proc1B 3 Proc1 1 ^<earlier> Proc1A 2 ^Proc1 FirstSym --> Proc1B 2 ^Proc1A 2 Proc1 1 ^<earlier> FirstSym --> Proc2 1 ^Proc1 1 The appending of a symbol to the table by Declare is straight- forward. A new instance of the record structure is created, then the Next field of the new record is made to point to the current FirstSym, and finally FirstSym is adjusted to point to the new record. Thus each new record is put at the "front" of the list. The culling of old symbols by Blot is not quite as obvious. Each name must be made inaccessible once the program passes beyond the scope of that name. In the fragment above, for example, Proc2 should not be able to call Proc1A, Proc1B or Proc1Ba because those procedures are local to Proc1. In Dada this is accomplished by eradicating the symbol-table entry for a procedure once the "end" that marks the limit of its scope is reached. The global variable CurrentScope is incremented each time a block is entered; Blot decrements CurrentScope and then works up the list, unlinking any entries whose scope is numerically greater than CurrentScope. Two further notes are needed. First, observe that "end" can appear in two contexts in Dada (or Pascal): at the end of a block or at the end of a compound statement. Blot must be called only on exiting a block, not on completing a compound statement. Second, the technique of making symbol-table entries inaccessible by simply erasing them may not work in more elaborate compilers. If there is an optimizing pass, for example, it will probably need to consult the symbol table. The proper approach then is to include a flag field in the record, so that a procedure like Blot can mark a name as hidden without destroying the entry. This also allows better error messages--"Variable referenced outside its scope" rather than "Undeclared variable." 7. Forth systems vary a good deal, not only in the language they accept but also in the file format they expect. The code generator given here creates "screen" files made up of 1,024-byte blocks. Each such block is conventionally viewed as a screen of 16 lines with 64 characters per line. There are no end-of-line delimiters; every character position that is not occupied by Forth text is filled with an ASCII blank. Input in this form should be accept- able to many Forth systems, provided they use the operating-system disk format. The published routines were tested with versions of PC/FORTH, from Laboratory Microsystems, Inc. The plasticity and mutability of Forth itself raises other problems. The various standards (fig-Forth, Forth-79, Forth-83) differ appreciably, and there are differences even among systems that are meant to conform to the same standard. I have tried to confine myself to Forth words that are common to all systems, but no doubt incompatibilities will turn up. The code generator can produce the following Forth words: ( ) FORTH DEFINITIONS DECIMAL CONSTANT VARIABLE : ; ;S --> @ ! . CR MINUS 0= = < > DUP DROP SWAP OVER KEY EMIT IF ELSE THEN BEGIN WHILE REPEAT OR AND - * + / MOD In addition, the following synonyms are defined in the header that precedes every program: 1 CONSTANT TRUE 0 CONSTANT FALSE : NEGATE MINUS ; : NOT 0= ; : <> = NOT ; : >= < NOT ; : <= > NOT ; If any of this will make your Forth system choke, change it, or add definitions to the header to provide compatibility. 8. The Dada compiler employs the parsing technique called recursive descent, which directly reflects the recursive definition of the language syntax. The program itself also illustrates this structure: the routines are nested so that each one is accessible only to the nest-higher-level routine that calls it. The overall structure looks like this: ParseProgram ParseVariables ParseBlock ParseStatement ParseExpression ParseSimpleExpr ParseTerm ParseSignedFactor ParseFactor Thus to recognize a program the parser must find variable declarations and a block; in recognizing a block it tries to resolve the input stream into statements, and so on. The recursive structure takes care of much of the bookkeeping that would otherwise burden the program (and the programmer). This is clearest in the parsing of expressions. In Forth as well as in most machine languages expressions must be given in postfix notation, whereas Dada and other "algebraic" languages use infix notation. Making the conversion requires maintaining a stack of pending operators. For example, the infix expression 3 * (4 + 5) must be converted to the postfix 3 4 5 + *; to do so the operators "+" and "*" must be stacked until both their operands have been read from the input stream. Examination of the parsing routines reveals no stack-management apparatus. How is it done? The stack employed is the return stack created by the storage-allocation subsystem of the Pascal com- piler. Although HoldMultOp, for example, is defined as a single variable, the executing program can actually create many instances of it. Every time ParseTerm is called, a new instance of HoldMultOp is pushed onto the return stack. One final note on the parser: Just as the scanner can see only the current character in the source file, the parser has access only to the current token in the input stream. There is no explicit lookahead or backtracking. Whereas the scanner really does its analysis one symbol at a time, however, the parser saves a great deal of information on its (hidden) stack. Indeed, the first identifier in a program--the name of the program itself--is saved until the main block at the very end of the program is reached. 9. The type checking done in DADA.PAS is not quite complete. The operators are classified as RelOps, AddOps and MultOps according to their precedence relations--multiplication must be done before addition and addition before comparison. There is another dimen- sion of classification, however, that the compiler ignores. Log- ical OR has the same precedence as multiplication, and logical AND is at the same level as addition, but the logical and arithmetic operations properly apply to values of different types. As the program stands, the compiler makes certain that the two operands in each operation have the same type, but there is no assurance the operator is appropriate. Thus "6 + FALSE" would generate an error, but "6 AND 7" or "TRUE * FALSE" would pass without com- plaint. Because Forth encodes boolean values as integers, such expressions will also execute without causing a run-time error. The remedy is straightforward, although it makes the code somewhat bulkier. ParseSimpleExpr and ParseTerm must be made to analyze logical and arithmetic expressions separately. Both operands of AND and OR must be boolean values, and both operands of +, -, * and / must be integers. In ParseExpression some decisions need to be made. Boolean values can surely be compared for equality, but is "TRUE > FALSE" a meaningful expression? All this becomes still more complicated when additional data types, such as strings and characters, are introduced. 10. There is a minor discrepancy between grammar and compiler in the function ParseStatement. The grammar for Dada, following that for Pascal, defines the semicolon as a statement separator, not a statement terminator. Hence the proper syntax of a compound state- ment is: begin A := B; B := C; C := D end where there is no semicolon following the last simple statement and preceding the "end". It's hard enough, however, remembering where to put semicolons without also remembering special rules about where not to put them. Many Pascal compilers therefore make the final semicolon optional--unneeded but harmless if present. The same policy is adopted in Dada.